Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import course staff management command #255

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

varshamenon4
Copy link
Member

JIRA: COSMO-200

Description: Create a management command to populate the CourseStaffRole model and any related User objects with data from the LMS. Once merged create an ESRE ticket to upload the CSV file and run the command.

Copy link
Member Author

@varshamenon4 varshamenon4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to put this draft out for initial review to get feedback and to make sure I'm on the right track. Also, I'm guessing for testing that I can just create a CSV and test locally that this populates the db in django admin?

LOGGER_NAME = 'edx_exams.apps.core.management.commands.bulk_add_course_staff'


class TestBulkAddCourseStaff(TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth adding a test here that ensures this command produces the expected number or queries w/ assertNumQueries. example:

with self.assertNumQueries(4):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! When I used this method, I see that there are a number of calls: https://github.com/edx/edx-exams/actions/runs/8394389300/job/22991407617?pr=255. Is it better to just check relative number of queries (for example, that batch size affects the number of queries for example) or is there a better way to finetune? I couldn't figure out a good way to constrain to only a certain type of query (for example, just the bulk query). Was poking around here: https://docs.djangoproject.com/en/5.0/topics/testing/tools/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably just make this it's own test rather than baking it into the others as an extra assertion. For that one I'd try a bunch of extra rows to ensure the query count has not increased in an unexpected way based on more data.

For getting the expected number you should be able to walk though the code and count things up for what you expect. If it's different, it would be good to know why or where the extra query is coming from. There's a few ways to debug django queries but an easy one is to use

from django.db import connection
print(connection.queries)

@varshamenon4 varshamenon4 force-pushed the varshamenon4/import-course-staff-mgmt-cmd branch from 0d5b026 to 901ebb0 Compare March 22, 2024 17:36
@varshamenon4 varshamenon4 marked this pull request as ready for review March 22, 2024 17:36
Copy link
Member Author

@varshamenon4 varshamenon4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for review! Added some specific questions about batch testing and also testing using assertNumQueries.

parser.add_argument(
'--batch_size',
type=int,
default=10000,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good default size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a default I think maybe we keep the number lower in the hundreds or 1k max. Not sure when we'd start running into memory issues with a really big bulk create query. You could play with this locally to get an estimate however.

Copy link
Contributor

@zacharis278 zacharis278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still looking over things but a few comments to start

parser.add_argument(
'--batch_size',
type=int,
default=10000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a default I think maybe we keep the number lower in the hundreds or 1k max. Not sure when we'd start running into memory issues with a really big bulk create query. You could play with this locally to get an estimate however.

LOGGER_NAME = 'edx_exams.apps.core.management.commands.bulk_add_course_staff'


class TestBulkAddCourseStaff(TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably just make this it's own test rather than baking it into the others as an extra assertion. For that one I'd try a bunch of extra rows to ensure the query count has not increased in an unexpected way based on more data.

For getting the expected number you should be able to walk though the code and count things up for what you expect. If it's different, it would be good to know why or where the extra query is coming from. There's a few ways to debug django queries but an easy one is to use

from django.db import connection
print(connection.queries)

csv = self._write_test_csv(csv, lines)
with self.assertNumQueries(2):
call_command(self.command, f'--csv_path={csv.name}')
assert CourseStaffRole.objects.filter(course_id='course-v1:edx+test+f20').count() == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to test if the correct user/role was created here rather than assert any role exists. Same for the below tests

@varshamenon4 varshamenon4 force-pushed the varshamenon4/import-course-staff-mgmt-cmd branch 2 times, most recently from 38b57fe to b0a9515 Compare March 28, 2024 12:53
Copy link
Member

@MichaelRoytman MichaelRoytman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I left a few, hopefully quick, questions for you.

"""
Add the given set of course staff provided in csv
"""
reader = list(unicodecsv.DictReader(csv_file))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What made you choose the unicodecsv package? I ask because Python's native csv module has a DictReader class that should provide the same functionality. Looking at the docs for unicodecsv, it looks like it improves upon some unicode issues in Python 2, but we're on Python 3 here, so I'd imagine we don't benefit from unicodecsv. Where possible, I think it would be better to leverage native packages. I think this should be an easy swap without any impact to the command or its tests. What do you think?

As a nit, at least for the csv.DictReader class, the class is iterable, so you don't have to cast the value to a list; you can just iterate the instance. Maybe unicodecsv.DictReader is different? Oh, I see. You're casting it to a list because you're iterating reader twice and don't want to exhaust the iterable, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point! I've changed to use csv. The only reason I used unicode was because I had referenced code from elsewhere... so not really a good reason haha.

I primarily cast to a list because I check the length of the reader (in order to batch the creates for the course staff). Lemme know if this is a bad idea though!

for i in range(0, len(reader), batch_size):
CourseStaffRole.objects.bulk_create(
CourseStaffRole(
user=User.objects.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the answer to this question, and maybe this is not possible. But I noticed that you create the users_existing set querying the User model via the username field. Here, though, you're querying the User model via the username and the email field. I'm wondering whether a scenario is possible where a user's would be found in users_existing but not in this query.

If we're pulling this data from the LMS, I think this could happen, because email is modifiable, but username isn't. I think email is maintained on our User model via syncing with the JWT when they hit edx-exams in the process of an exam, right? Do you think this is something we need to address here? I'm just not sure this code guarantees a user will exist, which could cause an exception here.

Since the function is in a transaction, it should be okay, because it'll rollback, we can correct the CSV, and try again, but it would be better to handle explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, that's a really good point. Since the username is unique, could I just get by username instead? I've updated to be that way but let me know if that could cause any issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's good! username should not change.

@varshamenon4 varshamenon4 force-pushed the varshamenon4/import-course-staff-mgmt-cmd branch from 5c86af2 to e4f5931 Compare March 29, 2024 16:08
Copy link
Member

@MichaelRoytman MichaelRoytman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you for incorporating my feedback.

@varshamenon4 varshamenon4 force-pushed the varshamenon4/import-course-staff-mgmt-cmd branch from 87fdab0 to 195a50e Compare April 2, 2024 16:53
Copy link
Contributor

@zacharis278 zacharis278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@varshamenon4 varshamenon4 merged commit cae41d6 into main Apr 2, 2024
7 checks passed
@varshamenon4 varshamenon4 deleted the varshamenon4/import-course-staff-mgmt-cmd branch April 2, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants